Barcelona’s Digital Landscape: a data-driven exploration of urban dynamics around Sagrada Familia. AI-generated by our team.
A tourism company based in Zürich, Switzerland, has observed a significant increase in travel demand to Barcelona in recent years. Indeed, Barcelona ranks as the third most in-demand city for Airbnb rentals in Europe, behind Paris and London.
Consequently, the company’s manager has requested a Machine Learning study and analysis of Airbnb accommodations in the city. The goal is to understand price behavior and identify the factors influencing accommodation costs and occupancy, enabling the company to provide optimal responses to clients’ inquiries.
To achieve this goal, the team has decided to analyse and address three question to provide comprehensive insights for the manager.
Barcelona is one of the most visited cities in Europe, and the rise of Airbnb and other short-term rental platforms has led to a notable increase in tourism. However, this growth also presents challenges for accommodation businesses and the local housing market. A study conducted by the Social Science Research Network (SSRN, link: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3428237) revealed that rental costs in neighborhoods with high Airbnb activity increased by 7% between 2009 and 2016. This is primarily due to the fact that property owners, motivated by the demand from tourists seeking short-term rentals, frequently opt to lease their properties at higher rates during the short term rather than committing to long-term leases.
For these reasons, it is crucial for tourism companies to understand this dynamic market to remain competitive and provide tailored services to their clients.
The Zürich-based tourism company needs reliable data on Airbnb prices and occupancy rates to make data-driven recommendations and stay ahead of competitors.
This analysis is for educational purposes only. The findings are based on public data and are not professional advice. The results should not be used for business or policy decisions.
To conduct the study, the team has decided to analyse a dataset of Barcelona Airbnbs available on the Kaggle website (link: https://www.kaggle.com/datasets/fermatsavant/airbnb-dataset-of-barcelona-city)
The dataset consists of 19.833 observations across 25 variables, including geographical zones, amenities, prices, and accommodations.
X: numerical index for rowsid: unique identifier for
listingshost_id: unique identifier for
hostshost_listings_count: number of
listings by the hostlatitude: geographic latitude of the
listinglongitude: geographic longitude of the
listingaccommodates: number of guests the
listing can accommodatebathrooms: number of bathrooms in the
listingbedrooms: number of bedrooms in the
listingbeds: number of beds in the
listingminimum_nights: minimum number of
nights required for bookingavailability_30: number of available
nights in the next 30 daysavailability_60: number of available
nights in the next 60 daysavailability_90: number of available
nights in the next 90 daysavailability_365: number of available
nights in the next 365 daysnumber_of_reviews_ltm: number of
reviews in the last 12 monthsreview_scores_rating: average review
rating scorehost_is_superhost: indicates if the
host is a superhost ("t" or "f")has_availability: indicates if the
listing is available for booking ("t" or
"f")neighbourhood: name of the
neighbourhood where the listing is locatedzipcode: postal code of the
listingproperty_type: type of property (e.g.,
“Apartment”)room_type: type of room (e.g., “Entire
home/apt”)amenities: list of amenities provided
in the listingprice: price of the listing as a
string (e.g., “$130.00”)As can be appreciated, the variable ‘price’ has a ‘Character’ data type. Therefore, in the chapter 3, this field will be transformed into an integer variable to enable the necessary calculations.
In order to streamline the calculations and analysis, a sub-dataset will be created in the following steps, considering 10.000 observations randomly selected. Additionally, a seed is created to ensure the same observations are maintained throughout the analysis
To address the research question, the study will be divided into three parts. First, an Exploratory Data Analysis (EDA) will be conducted to gain a deeper understanding of the data. Second, Machine Learning models will be implemented, and their performance will be evaluated to identify the best-performing model. Finally, the selected model will be used to provide the most accurate answer to the research question posed by the team.
The different models to be developed are:
First, the pricing variable is converted into a numeric format, and in the fifth chapter of this report (Machine Learning Models), the categorical variables will be transformed into factors for further analysis and modeling.
We inspect the dataset to have idea about possible missing values, their amount and their distribution.
| Missing_Count | Missing_Percent | |
|---|---|---|
| review_scores_rating | 2415 | 24.15% |
| host_listings_count | 22 | 0.22% |
| beds | 18 | 0.18% |
| bathrooms | 6 | 0.06% |
| bedrooms | 2 | 0.02% |
host_listings_count : since is not possible to make any calculation on the number of listing of the host, we exclude the 22 rows that lack of it.
bathrooms : the number of bathrooms is missing in 6 rows.
beds : the number of beds is not specified for 18 assets.
bedrooms : 2 rows contains missing value and can be deleted.
review_scores_rating : the review score rating is missing in 2415 rows of 10000. It’s a quite relevant percentage, around the 24% of the data we selected. In this case we decide to impute the missing values replacing it with the value 0.
From the correlation matrix is possible to deduct that there’s almost no correlation between availability periods and number of beds, bedrooms, bathrooms. It would suggest that the availability of the properties does not depend from those features, rather probably from the location and facilities. We can also observe that there is a positive correlation between number of bedrooms, beds and bathrooms, as also a positive correlation between the different availability periods.
Since the price variable is a key focus of our analysis, an outlier analysis of this variable has been conducted.
Out of a total of 10,000 values, 804 (8.04%) are identified as outliers. Below, a boxplot is presented to visualize the median and the outlier observations.
From the boxplot above, it can be concluded that the median price is approximately 65€ per night, with 50% of the observations concentrated between 40€ (25th percentile) and 112€ (75th percentile), representing the interquartile range (IQR).
Additionally, the presence of numerous outliers extending to the right indicates a right-skewed distribution, meaning higher prices are influencing the dataset.
The significant number of observations with higher prices could suggest the presence of many luxury properties. Therefore, further analysis is required to identify the factors influencing these price variations.
| neighbourhood | property_type | bedrooms | price |
|---|---|---|---|
| Gràcia | Bed and breakfast | 1 | 8000 |
| Sants-Montjuïc | Boat | 4 | 8000 |
| Vila de Gràcia | Bed and breakfast | 1 | 8000 |
| Vila de Gràcia | Bed and breakfast | 1 | 8000 |
| Vila de Gràcia | Bed and breakfast | 1 | 8000 |
| Vila de Gràcia | Bed and breakfast | 1 | 8000 |
| Eixample | Boutique hotel | 1 | 6000 |
| Eixample | Hotel | 1 | 6000 |
| Eixample | Hotel | 1 | 6000 |
| Eixample | Hotel | 1 | 6000 |
| Eixample | Hotel | 1 | 6000 |
| Eixample | Hotel | 1 | 6000 |
| La Nova Esquerra de l’Eixample | Hotel | 1 | 6000 |
| La Nova Esquerra de l’Eixample | Hotel | 1 | 6000 |
| La Nova Esquerra de l’Eixample | Hotel | 1 | 6000 |
| La Nova Esquerra de l’Eixample | Hotel | 1 | 6000 |
| Sant Antoni | Boutique hotel | 1 | 6000 |
| Sant Antoni | Boutique hotel | 1 | 6000 |
| Sant Antoni | Hotel | 1 | 6000 |
| Sant Antoni | Hotel | 1 | 6000 |
It seems that the price variable may contain erroneous entries. For further analysis, research revealed that the average nightly rate for an Airbnb in Barcelona is €93 (according to Hostel Geeks, link: https://hostelgeeks.com/best-airbnbs-in-barcelona-spain/). Therefore, prices of €8,000 are likely errors. As a result, it was decided to exclude prices above €1,000 from the analysis.
The new summary for tha Price variable is the following:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 40.00 65.00 96.93 110.00 1000.00
Now, a new Correlation Matrix with the filtered observations of the price variable is displayed.
We can observe that the price variable is now most correlated with variables related to the size and capacity of an Airbnb, such as bathrooms, bedrooms, number of beds, and accommodates.
On the other hand, general availability and review scores have little impact on the price.
From the histograms above, several variables exhibit right-skewed distributions, including price, minimum_nights, and number_of_reviews_ltm.
On the other hand, the data suggests that in Barcelona, Airbnb listings are primarily designed for small groups of people seeking short-term stays. Additionally, these accommodations tend to receive high review scores, indicating good guest satisfaction with the different properties.
Almost 19% of the hosts offering an Airbnb in Barcelona are not categorized as Superhosts. This means tourists can find accommodations in the city where hosts go above and beyond to provide excellent hospitality. This insight could be a key factor in explaining the higher price values observed in certain neighborhoods.
This section presents a variety of plots for the categorical variables, including Property Type, Room Type, Top Neighbourhoods, and Amenities.
From the plot above, it can be observed that apartments dominate the Airbnb market in Barcelona, accounting for 86% of the listings.
On the other hand, the low availability of luxury or specialized accommodations, such as Boutique Hotels (0.5%), Guest Suites (0.7%), and Lofts (2.4%), suggests that these property types cater to a niche market. Travelers opting for these accommodations are likely visiting Barcelona for specific reasons, such as work or unique travel experiences.
The majority of Room Type are split between Entire home/Apartment and Private Room.
Less than 1% of the hosts offer Shared Room, which suggests that travelers prefer more privacy during the stay.
The Eixample district of Barcelona represents the most popular neighbourhood on Airbnb, with 27% of the total listings. This is followed by Ciutat Vella, which accounts for 18.8% of the listings.
Eixample is situated in close proximity to the historic centre of the city and is more centrally located in comparison to other neighbourhoods. The area offers many attractions for tourists, including La Sagrada Familia, Casa Batlló, and Passeig de Gràcia. In addition to its excellent transport connections, Eixample is an ideal destination for visitors.
On the other hand, Ciutat Vella is the oldest part of Barcelona and serves as the heart of the city, known for its historical charm and vibrant cultural scene.
Given this, the Tourism Company in Zürich could recommend that its clients focus on these neighbourhoods to attract more customers and enhance their travel experience.
The Wordcloud above, provides the most common amenities offered by the different hosts.
The most prominent amenities are: Kitchen, Wifi, Heating, Washer and Hair dryer. his can be taken to indicate that tourists may consider a place to be comfortable for their stay if it meets these basic requirements.
This section was designed to allow the employees of the Tourism Company and other Users to interact with the data on neighborhoods, prices, and reviews.
The purpose of the following interactive plot is to allow users to select a neighborhood of interest and visualize, on a map, the different accommodations available along with their price per night when one of the circles is clicked.
In the heatmap below, users can observe the zones with higher accommodation prices (red/orange areas).
In contrast, the zones colored in green or blue represent lower-priced neighborhoods.
According to the heatmap, the Tourism Company can recommend the red zones to tourists looking for more centralized accommodations, regardless of price. On the other hand, tourists who want to save money can be advised to choose accommodations in the green or blue areas, which are typically farther from the city center.
From the plot above, the following insights can be derived: - The majority of listings are concentrated at the lower price range (below 250 Euros), irrespective of room type. - Accommodations with high review scores (exceeding 90 points) are distributed across all price categories, indicating that well-reviewed Airbnbs are not restricted to a particular room type or price range.
In this chapter, different machine learning models will be explored to predict Airbnb prices and the occupancy rate over the next 30 days.
The formula to calculate the Occupancy rate in 30 days is:
Occupancy Rate: \[ \text{Occupancy Rate} = \left(1 - \frac{\text{Availability 30 Days}}{\text{Total Days = 30}}\right) \times 100 \]
According to the formula above, a new columns with the Occupancy rate is calculated.
With this new predictor, the occupancy rate in the next 30 days is going to be predicted.
As mentioned in the previous chapters, the categorical variables are converted into factors to proceed with the modeling phase. These are: neighbourhood, property_type, room_type, availability_30, zipcode. Moreover, variables that represent counts or continuous numbers are converted into numeric: accommodates, bathrooms, bedrooms, beds, latitude, longitude, review_scores_rating and minimum_nights.
Before analysing the different models, we need to divide the data into a training set and test set. The first set will be used to find the relationship between dependent and independent variable, while the second set will be used to analyse the performance of the models. We decide to use 60% of the data set as a training set, and the rest as a test set. We also remove rows with NAs values in the test set to avoid problems in the evaluation of models’ prediction.
set.seed(1000)
# Define the number of groups and the amount of sample for each
group <- sample(2, nrow(BCN_Accomm),
replace = TRUE,
prob = c(0.6, 0.4))
# training data set with around 60% of the samples
train <- BCN_Accomm[group==1,]
# test data set with around 40% of the samples
test <- BCN_Accomm[group==2,]
test <- na.omit(test) # removing rows with NAs values might reduce the size of the test set
We assure that all factor variables have the same levels in the train and test sets, including a final check to assure that no NAs values are present as concerns numerical variables.
At this point, we prepare also the train and test sets in a normalized version that will be used in some of the models, naming them respectively train_normalized and test_normalized.
Moreover, as regards the categorical variable neighborhood, for some of the models, it needs to be encoded into numerical values. We apply a one-hot encoding so that each unique neighborhood will become a separate binary feature.
Variables for Pricing Model
Next, based on the Correlation Matrix, the variables used to address the first research question about price, are:
bedroomsbathroomsaccommodatesbedslatitude'** and **'longitudereview_score_ratingminimum_nightsproperty_typeroom_typeneighbourhoodVariables for occupancy Rate Model
The variables used for the Occupancy rate in one month are:
latitude and
longitude (location)bathroomsbedroomsaccommodatesbedspriceminimum_nightsreview_score_ratingneighbourhoodBefore fitting the models, it is a good practice to have an overview of the relationships between response and predictors. This analysis will also support the decision on distributions and parameters to choose in the different models (i.e. which kernel for SVM, the family distribution in GLMs, …).
In Linear model, the response variable is a continuous variable that is assumed to follow a normal distribution. To answer the question 1 What are the key factors influencing accommodation prices in Barcelona? the response variable in the model is the price of accommodation in Barcelona. Through fitting the linear model and analyzing the linear relations among response variable and predictors to identify which predictors might have effect in the price and interpretations.
Specifications. Analysis of chosen variables, relation among variables and correlation will be study directly with the model. For this model variable availability_30 treat as numeric.
To answer the question 2 Can we predict occupancy rates based on location, amenities, or other factors? the baseline linear model is fitted with variables according to EDA and being analysis. The response variable the calculate variable: occupancy_rate_30
As first step, a baseline linear model is performed with original dataset numeric predictors:
##
## Call:
## lm(formula = price ~ host_listings_count + latitude + longitude +
## accommodates + bathrooms + bedrooms + beds + minimum_nights +
## availability_30 + availability_60 + availability_90 + review_scores_rating +
## availability_365 + number_of_reviews_ltm, data = BCN_Accomm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -442.07 -35.91 -13.01 10.65 942.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.991e+03 2.671e+03 3.741 0.000185 ***
## host_listings_count 1.197e-01 1.723e-02 6.947 3.96e-12 ***
## latitude -2.519e+02 6.537e+01 -3.854 0.000117 ***
## longitude 1.988e+02 5.429e+01 3.661 0.000252 ***
## accommodates 2.023e+01 9.148e-01 22.112 < 2e-16 ***
## bathrooms 1.077e+01 1.763e+00 6.111 1.03e-09 ***
## bedrooms 9.494e+00 1.718e+00 5.526 3.35e-08 ***
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 91.81 on 9860 degrees of freedom
## Multiple R-squared: 0.2946, Adjusted R-squared: 0.2936
## F-statistic: 294.1 on 14 and 9860 DF, p-value: < 2.2e-16
Only four predictors appear to have no significant effect on price, while the rest seem to influence the response variable. However, the linear regression model explains approximately 30% of the variability in accommodation prices in Barcelona (Adjusted R-squared). The average deviation of the observed values from those predicted by the regression model (RSE) is around 91, which is relatively high given the range of prices.
With these results, and given that the response variable is positive throughout its entire range, applying a logarithmic transformation to the response variable could help improve the model’s performance.
Comparing price with transforming response variable —> log(price) through following boxplots.
As observed in the boxplot, the log transformation reduces the skewness of the price distribution, resulting more symmetrical distribution. The boxplot of price shows highly skewed distribution, whereas the boxplot after the log transformation compresses the range and reduces the impact of extreme values. The resulting distribution appears more symmetrical, which is desirable for linear regression, as it can improve model robustness and provide a better fit.
The following boxplot visualizes the relationship between the number of guests (accommodates) and the log-transformed price. This helps to identify how the price distribution varies across different accommodation capacities
We check the performance of the model with transformed log transformation for the response variable price (the code is omitted in this output).
##
## Call:
## lm(formula = log(price) ~ host_listings_count + latitude + longitude +
## accommodates + bathrooms + bedrooms + beds + minimum_nights +
## availability_30 + availability_60 + availability_90 + review_scores_rating +
## availability_365 + number_of_reviews_ltm, data = BCN_Accomm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4461 -0.3541 -0.0410 0.3110 4.1241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.815e+02 1.688e+01 10.750 < 2e-16 ***
## host_listings_count 1.041e-03 1.089e-04 9.559 < 2e-16 ***
## latitude -4.404e+00 4.132e-01 -10.657 < 2e-16 ***
## longitude 1.945e+00 3.432e-01 5.666 1.50e-08 ***
## accommodates 2.404e-01 5.784e-03 41.572 < 2e-16 ***
## bathrooms -3.507e-02 1.115e-02 -3.147 0.00166 **
## bedrooms 4.848e-02 1.086e-02 4.464 8.15e-06 ***
## beds -5.438e-02 6.406e-03 -8.489 < 2e-16 ***
## minimum_nights -4.853e-03 3.399e-04 -14.276 < 2e-16 ***
## availability_30 7.701e-03 1.673e-03 4.604 4.20e-06 ***
## availability_60 -3.657e-05 1.452e-03 -0.025 0.97990
## availability_90 2.905e-04 7.221e-04 0.402 0.68753
## review_scores_rating -4.173e-04 1.645e-04 -2.537 0.01119 *
## availability_365 3.033e-04 5.703e-05 5.318 1.07e-07 ***
## number_of_reviews_ltm -6.436e-04 3.658e-04 -1.759 0.07853 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5804 on 9860 degrees of freedom
## Multiple R-squared: 0.4437, Adjusted R-squared: 0.443
## F-statistic: 561.8 on 14 and 9860 DF, p-value: < 2.2e-16
The results lm.BCN_Accomm.total_num.transf of RSE and R-squared improve with the log price transformation. In general, all the predictors seem to have impact on the response variable apart from few as availability_90. They will be removed from the model.
After inspecting the correlation of bedrooms with other similar variables, reveals certain degree of redundancy, as these variables likely capture similar information about accommodation capacity.
Predictors with GVIF greater than 5 suggest multicollinearity with other variables. Therefore, the availability_* variables are correlated with each other or with other predictors in the model. availability_60 and availability_90 will be removed from the model.
Removing from the model those predictors which seem not to have an effect on the response variable or show multicollinearity. The total result model lm.BCN_Accomm.total_num.0 is not shown, as it is an intermediate step in the modeling process. The last part of the summary shows:
## [1] "Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1"
## [2] ""
## [3] "Residual standard error: 0.5809 on 9863 degrees of freedom"
## [4] "Multiple R-squared: 0.4426,\tAdjusted R-squared: 0.442 "
## [5] "F-statistic: 712 on 11 and 9863 DF, p-value: < 2.2e-16"
## [6] ""
In this model the VIF values are all within acceptable ranges regarding multicollinearity. The model explains a substantial proportion of the variability in the log-transformed accommodation price, with all predictors showing varying levels of effect on the response variable. This linear model lm.BCN_Accomm.total_num.0 will be used to merger with categorical variables, later on.
Some briefly reasons about the categorical variables included or not in the linear model (besides EDA reasons):
X, id, host_id as identifying variables, they will not be considered relevant predictors. Zipcode gives information about location. Zipcode was excluded as latitude and longitude provide more precise localization information, making zipcode redundant. However, neighbourhood will be include iniatially. (EDA reasons)
Regarding original layout of amenities in the dataset for being directly treated as factor is too complex.
Let´s focus on the remaining factors.
has_availability has only one level and is not taken into consideration for the model.
To evaluate the influence of the rest categorical variables on the log-transformed accommodation price, boxplots were generated for host_is_superhost, neighbourhood, property_type, and room_type. The boxplots provide a preliminary overview of how these factors might impact the response variable.
A first model is fitted with only categorical variables called lm_categorical. This model includes host_is_superhost, neighbourhood, property_type, and room_type as predictors. To test the effects of categorical variables with more than two levels drop1() function must be used. Furthermore, results obtained with the drop1() function are unaffected from the ordering of the predictors. The results indicate that all the included factors have a significant effect on the log-transformed price.
For brevity, the detailed output of lm_categorical and the drop1() results are not displayed, as this model will be merged with numerical variables in subsequent steps to create a more comprehensive model.
Let´s add these four factors to the chosen previous linear model (only continous variables) seen before: (5.2.2) And analysis results in the following linear model lm_BCN_Accomm:
## Single term deletions
##
## Model:
## log(price) ~ host_listings_count + latitude + longitude + accommodates +
## bathrooms + beds + minimum_nights + availability_30 + review_scores_rating +
## availability_365 + number_of_reviews_ltm + host_is_superhost +
## neighbourhood + property_type + room_type
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 2579.2 -13048
## host_listings_count 1 3.62 2582.8 -13036 13.6981 0.0002159 ***
## latitude 1 5.97 2585.1 -13027 22.5957 2.028e-06 ***
## longitude 1 5.40 2584.6 -13029 20.4387 6.230e-06 ***
## accommodates 1 176.54 2755.7 -12396 668.7323 < 2.2e-16 ***
## bathrooms 1 1.75 2580.9 -13043 6.6387 0.0099932 **
## beds 1 2.82 2582.0 -13039 10.6692 0.0010931 **
## minimum_nights 1 139.38 2718.6 -12530 527.9837 < 2.2e-16 ***
## availability_30 1 76.20 2655.4 -12762 288.6452 < 2.2e-16 ***
## review_scores_rating 1 0.97 2580.1 -13046 3.6653 0.0555858 .
## availability_365 1 0.17 2579.3 -13049 0.6579 0.4173379
## number_of_reviews_ltm 1 3.54 2582.7 -13036 13.4111 0.0002515 ***
## host_is_superhost 1 18.19 2597.3 -12980 68.8865 < 2.2e-16 ***
## neighbourhood 65 107.79 2687.0 -12773 6.2818 < 2.2e-16 ***
## property_type 25 168.71 2747.9 -12472 25.5628 < 2.2e-16 ***
## room_type 2 378.84 2958.0 -11698 717.5364 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1"
## [2] ""
## [3] "Residual standard error: 0.5138 on 9770 degrees of freedom"
## [4] "Multiple R-squared: 0.5681,\tAdjusted R-squared: 0.5635 "
## [5] "F-statistic: 123.6 on 104 and 9770 DF, p-value: < 2.2e-16"
## [6] ""
The model summary indicates that the residuals are reasonably small. The R-squared and adjusted R-squared values demonstrate that a substantial portion of the variability in the log-transformed price is explained by the predictors in the model. availability_365 seems not to have effect on the response variable. It will be removed. review_scores_rating shows only marginal significance due to its p-value, but it has been retained to see future performance for some practical reason, as guest ratings might be an important factor in the accommodation market. The rest of predictors and factors seem to have impact on response variable.
Multicollinearity:
The results of the multicollinearity analysis indicate that most predictors fall within acceptable GVIF (lower than 5). However, neighbourhood factor has higher GVIF value. Also, between beds and accommodates might have some collinearity so as beds seem to have less impact in response variable than accommodates; beds will be removed to check the collinearity afterwards.
We re-fit the linear model with these findings lm_BCN_Accomm.1.
We also check the collinearity (code ommitted here).
Removing neighbourhood significantly reduces the GVIF for latitude and longitude, bringing their values closer to 1 and improving the model’s stability. Similarly, the GVIF for accommodates decreases to an acceptable value The interpretation of coefficients will be explained further on.
The following plots perform some visual interactions among the variables.
And after performance different combinations of interactions as for example:
Comparing with the model lm_BCN_Accomm.1 the difference in adjusted R squared is almost minimal, suggesting that the interactions terms add very little explanatory power to the model.The difference RSE is very short comparing with the model with interactions, which indicates a marginal improvement. These results suggest that these interaction terms have a meaningful impact on the response variable, even though their contribution to the overall model fit is small. The improvement from adding interactions might not justify the additional complexity.
Any new trial combinations results improve significantly the results getting. Therefore lm_BCN_Accomm.1 conclude as proposal linear model under all this reasons shows during the fitting.
## Single term deletions
##
## Model:
## log(price) ~ host_listings_count + latitude + longitude + accommodates +
## bathrooms + minimum_nights + availability_30 + review_scores_rating +
## number_of_reviews_ltm + host_is_superhost + property_type +
## room_type
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 2692.8 -12756
## host_listings_count 1 5.60 2698.4 -12738 20.4497 6.194e-06 ***
## latitude 1 24.27 2717.0 -12669 88.6455 < 2.2e-16 ***
## longitude 1 10.14 2702.9 -12721 37.0266 1.209e-09 ***
## accommodates 1 328.76 3021.5 -11620 1200.9848 < 2.2e-16 ***
## bathrooms 1 4.39 2697.2 -12742 16.0295 6.282e-05 ***
## minimum_nights 1 157.29 2850.1 -12197 574.6169 < 2.2e-16 ***
## availability_30 1 90.09 2782.8 -12433 329.1256 < 2.2e-16 ***
## review_scores_rating 1 0.68 2693.4 -12756 2.4932 0.1143742
## number_of_reviews_ltm 1 3.09 2695.8 -12747 11.2752 0.0007885 ***
## host_is_superhost 1 21.53 2714.3 -12679 78.6482 < 2.2e-16 ***
## property_type 25 190.91 2883.7 -12130 27.8970 < 2.2e-16 ***
## room_type 2 439.35 3132.1 -11267 802.5059 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1"
## [2] ""
## [3] "Residual standard error: 0.5232 on 9837 degrees of freedom"
## [4] "Multiple R-squared: 0.549,\tAdjusted R-squared: 0.5473 "
## [5] "F-statistic: 323.7 on 37 and 9837 DF, p-value: < 2.2e-16"
## [6] ""
Based on the results, the variables with the most significant effects on the log-transformed price, as indicated by the F-values and p-values, arethe following. Accommodates: is the strongest predictor, with an extremely high F-value, reflecting its critical role in determining pricing; room_type: also highly significant, this variable captures substantial differences in pricing between types of accommodations, such as entire homes, private rooms, and shared rooms; availability_30; minimum_nights; latitude and longitude; host_is_superhost.
Other variables, such as property_type, bathrooms, and number_of_reviews_ltm, also contribute to the model, though their effects are relatively smaller. Meanwhile, variables like review_scores_rating seems not to have impact on pricing.
The adjusted R-squared for this model is approximately 0.55, indicating that about 55% of the variability in the log-transformed price is explained by the predictors included in the model. The residual standard error (RSE) suggests that the average deviation of observed prices from the predicted values is moderate, reflecting a reasonably good fit.
Based on the coefficient calculations, several predictors stand out as highly relevant in explaining the variability in log-transformed accommodation prices.
Accommodates: This variable has a strong positive relationship with price, indicating that properties designed to host more guests are generally priced higher. Room Type: The type of room plays a crucial role in pricing. Private rooms tend to be priced lower than entire homes, while shared rooms show an even more significant reduction in price. Location (Latitude and Longitude): These variables capture spatial pricing trends, emphasizing the importance of location. Latitude has a negative impact and longitude has a positive effect, with prices increasing as one moves east. Availability (30 days): Short-term availability has a notable positive effect on pricing. Host_is_superhost: Being a superhost positively impacts prices. Minimum Nights: Properties requiring longer minimum stays tend to have slightly lower prices. Property Type: Boutique hotels and certain unique property types exhibit higher pricing, while other property types, such as guesthouses, show lower price levels. While the coefficients in the model are calculated in log-transformed terms, they can be interpreted in their original scale by applying an exponential transformation.
R² (R Squared): (from the summary). Meaning: The model explains approximately 55% of the variance in the log(price) variable.
Adjusted R²:Value: (from summary around 0.55) Meaning: Adjusted R² accounts for the number of predictors in the model, providing a more realistic measure when comparing models with different numbers of predictors.
Residual Standard Error (RSE): Value: 0.52 aprox. Meaning: On average, the residuals deviate by about 0.52 units from the predicted values of log(price). This is a measure of the model’s overall error.
Mean Absolute Error (MAE): 0.3890596.
After fitting a baseline linear model based on EDA variables being occupancy_rate_30 the response variable and remove predictors might not have effect on the response variable , model lm_occupancy.2 is analysed.
## Single term deletions
##
## Model:
## occupancy_rate_30 ~ latitude + longitude + bathrooms + bedrooms +
## accommodates + beds + price + minimum_nights + review_scores_rating +
## neighbourhood
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 8354200 66713
## latitude 1 915 8355115 66712 1.0732 0.3002403
## longitude 1 763 8354963 66711 0.8950 0.3441545
## bathrooms 1 38613 8392814 66756 45.2959 1.789e-11 ***
## bedrooms 1 12027 8366228 66725 14.1090 0.0001735 ***
## accommodates 1 25788 8379988 66741 30.2506 3.892e-08 ***
## beds 1 8047 8362247 66720 9.4396 0.0021293 **
## price 1 196542 8550743 66940 230.5566 < 2.2e-16 ***
## minimum_nights 1 16408 8370609 66730 19.2480 1.160e-05 ***
## review_scores_rating 1 99273 8453473 66827 116.4533 < 2.2e-16 ***
## neighbourhood 65 91402 8445602 66690 1.6495 0.0008074 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1"
## [2] ""
## [3] "Residual standard error: 29.2 on 9800 degrees of freedom"
## [4] "Multiple R-squared: 0.05924,\tAdjusted R-squared: 0.05214 "
## [5] "F-statistic: 8.339 on 74 and 9800 DF, p-value: < 2.2e-16"
## [6] ""
According to this model, bathrooms, bedrooms, accommodates , beds , price , minimum_nights , review_scores_rating might have impact on the predictor. Very weak impact one level of neighbourhood on occupancy_rate_30 latitude, longitude might not have effect on response variable Considering the performance, we remove these variables form the model.
In the new version, bathrooms, bedrooms, accommodates , beds , price , minimum_nights , review_scores_rating might have impact on the predictor.
After checking collinearity there are no GVIF values below 5. The model seems no to have collinear varibles.
Evaluation model lm_occupancy.2
The model demonstrates low explanatory power based on following metrics. RSE: 29. R-Adjusted square and R square are very low. Only approximately around 6% of the variability in occupancy rates is explained by the model.
The following analysis demonstrates the application of a linear model (lm_occupancy.2) to predict occupancy rates for accommodation listings using a simulated dataset. The first plot compares observed and fitted values to evaluate the model’s ability to capture the relationship between occupancy rate and key predictors, such as accommodates and price. The second plot visualizes residuals to assess model accuracy and highlight potential discrepancies.
Generalised Linear Model: Poission is extension of the Linear Model to deal with paritcular types of data: Count data. The Poisson Model, assumes a Poisson distribution of the data and uses the natural logarithm as a“link” function.
Therefore, for GLM:Poission Model the response variable is: availability_30
Defining Continuous and Categorical variables: Coming from linear model continuing with the variables and keeping convert categorical variables in factors: host_is_superhost, room_type, property_type.
Based on the data, first graph shows availability by host_is_superhost :This plot shows how availability (availability_30) varies between hosts who are superhosts and those who are not. The difference in medians is not significant, indicating that the superhost status may not strongly influence availability. The range of availability is slightly larger for the “Non-Superhost” category, suggesting greater variability in this group.
Second graph illustrates availability by Room Type :This plot compares availability (availability_30) across different room types (entire home/apt, private room, shared room)
After fit poisson model- glm_basic and analyzing results,
overdispersion is identified;
which is a common issue when modeling count data. The overdispersion of
this model calculate as deviance / residual is around 9. For solving the
overdispersion in Poisson model, Quasipoisson model will be fitted
later.
However, before fit quasi_Poisson model, Simulate New Observations: glm_basic because simulate() function doesn’t directly support quasi-Poisson models.
Based on simulation (Poisson simulation), first graph shows simulated availability_30 by host_is_superhost: Similar to the observed data, the differences between the categories are less pronounced.The range and variability are also slightly greater for non-superhosts, as in the observed data.
Second graph, shows the simulated availability_30 for each room type. The simulated data trends align well with the observed data: shared rooms have consistently lower availability, while entire apartments show greater variability. similarity between observed data and simulated distributions suggest that Poisson model is capturing main patterns.
Fit the Quassipoission Model. Display in some short way due to many levels of property_type.
Some predictors, as host_listings_count seems not to have impact in response variable. The rest variables seem to have an effect in the response variable. Several levels of factor property_type seems not to have effect in availability_30.
Therefore, lets check if property_type overall has effect in response variable using anova(). Fit a model without this factor and compare.
## Analysis of Deviance Table
##
## Model 1: availability_30 ~ price + host_listings_count + accommodates +
## minimum_nights + bathrooms + number_of_reviews_ltm + review_scores_rating +
## host_is_superhost + room_type + property_type + latitude +
## longitude
## Model 2: availability_30 ~ price + host_listings_count + accommodates +
## minimum_nights + bathrooms + number_of_reviews_ltm + review_scores_rating +
## host_is_superhost + room_type + latitude + longitude
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 9837 92774
## 2 9862 93275 -25 -501.41 2.2699 0.0002976 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model 1: glm_quasipoisson provides a significantly better explanation of the variability in availability_30. The variable property_type_group seems to have relevant role in the model. Therefore, not remove property_type from the model.
Next step, refit the model removing variables might have not impact on response variable.
##
## Call:
## glm(formula = availability_30 ~ price + minimum_nights + bathrooms +
## number_of_reviews_ltm + review_scores_rating + host_is_superhost +
## room_type + property_type, family = quasipoisson(link = "log"),
## data = BCN_Accomm)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.429e+00 2.872e-01 4.975 6.63e-07 ***
## price 1.216e-03 8.283e-05 14.682 < 2e-16 ***
## minimum_nights 2.035e-03 3.527e-04 5.771 8.12e-09 ***
## bathrooms 7.348e-02 1.533e-02 4.792 1.68e-06 ***
## number_of_reviews_ltm -7.829e-03 7.936e-04 -9.865 < 2e-16 ***
## review_scores_rating -1.530e-03 2.796e-04 -5.472 4.56e-08 ***
## host_is_superhostt -1.063e-01 3.099e-02 -3.429 0.000608 ***
## room_typePrivate room 2.787e-01 2.363e-02 11.791 < 2e-16 ***
## room_typeShared room 5.031e-01 9.647e-02 5.215 1.88e-07 ***
## property_typeApartment 4.858e-01 2.865e-01 1.696 0.089929 .
## property_typeBarn 3.617e-01 6.155e-01 0.588 0.556778
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 8.837591)
##
## Null deviance: 99642 on 9874 degrees of freedom
## Residual deviance: 92807 on 9841 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 7
Positive Relationship: Bathrooms: An increase in the number of bathrooms has a strong positive effect on the response variable. Room Type (Private Room): Properties categorized as “Private Room” positively influence the response variable compared to the reference level category. Room Type (Shared Room): Properties as “Shared Room” have an even stronger positive impact compared with reference level. Minimum Nights: A longer minimum stay requirement also has a positive association with the response variable, though the effect is smaller.
Negative Relationship: Number of Reviews (Last Month): An increase in the number of recent reviews has a significant negative effect on the response variable. Review Scores Rating: Higher review scores are negatively associated with the response variable, though the effect is more subtle. Host is Superhost: Being a superhost negatively affects the response variable, indicating that this attribute may not always align with the outcome being measured.
(Further details on the exponential transformed coefficients can be found in the accompanying R script)
## Null Deviance: 99642.31
## Residual Deviance: 92774.07
## Deviance Explained: 6.892891 %
The model explains only a small portion of the variability in the response variable, not even 7%. The model’s overall explanatory power is very limited.
Three levels of room_type:
## Entire home/apt
## Private room
## Shared room
hypothesis 1: if the Privacy accommodation differ from shared accommodation: Entire home/apt and Private room (private), together, comparing Shared room (shared) regarding effect availability_30 (quasi poisson model:glm_quasi_updated )
hypothesis 2: if private room differs from shared room regarding their effect in availability_30 (quasi poisson model:glm_quasi_updated )
Matrix contrast for the two hypotheses:
## Entire home/apt Private room Shared room
## privacy vs shared 0.5 0.5 -1
## private room vs shared room 0.0 1.0 -1
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: User-defined Contrasts
##
##
## Fit: glm(formula = availability_30 ~ price + minimum_nights + bathrooms +
## number_of_reviews_ltm + review_scores_rating + host_is_superhost +
## room_type + property_type, family = quasipoisson(link = "log"),
## data = BCN_Accomm)
##
## Linear Hypotheses:
## Estimate Std. Error z value Pr(>|z|)
## privacy vs shared == 0 -0.36376 0.09483 -3.836 0.00015 ***
## private room vs shared room == 0 -0.22443 0.09464 -2.371 0.02010 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
There is a clear significant difference among privacy accommodation (entire house/apt and private room ) and shared accommodation (shared rooms) regarding availability_30. Private accommodations (entire home/apt and Private room) have significantly lower availability compared to shared accommodations (Shared room).
There is a clear significant difference among private room and shared room regarding availability_30. Private room has significantly lower availability compared to Shared room.
How when increase the number_of_reviews_ltm affect to availability_30 by the quasiPoisson model:
## Effect of 10 additional reviews: 0.9247006
## Percentage change in availability: -7.529937 %
## Effect of 50 additional reviews: 0.676092
## Percentage change in availability: -32.3908 %
This reflects that listings with many reviews are more popular and tend to be booked more frequently.
The aim is to know the days of the response variable for a 3 concrete accommodations using predict() function.
Accommodation 1: Minimum Nights: 3, Bathrooms: 1, Reviews: 20, Rating: 90, Superhost: No, Room Type: Private Room, Property Type: Apartment Accommodation 2: Minimum Nights: 5, Bathrooms: 2, Reviews: 50, Rating: 95, Superhost: No, Room Type: Entire Home/Apt, Property Type: House Accommodation 3: Minimum Nights: 7, Bathrooms: 3, Reviews: 100, Rating: 85, Superhost: Yes, Room Type: Shared Room, Property Type: Boat
## Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): object is not a matrix
## Error in eval(expr, envir, enclos): object 'predicted_availability' not found
Interpretation of Predictions:
accommodation 1: For the first hypothetical listing, the predicted availability_30 is approximately 8 days. accommodation 2: For the second listing, the predicted availability_30 is approximately 5 days. accommodation 3: For the third listing, the predicted availability_30 is approximately 13 days.
Following, predicted availability values_30days for the first six listings in original dataset (BCN_Accomm), based on the fitted quasi-Poisson model (glm_quasi_updated):
## 1 2 3 4 5 6
## 8 8 7 26 8 9
Interpretation:
Listing 1: Predicted availability is 8 days. Listing 2: Predicted availability is 9 days. Listing 3: Predicted availability is 11 days. Listing 4: Predicted availability is 7 days. Listing 5: Predicted availability is 4 days. Listing 6: Predicted availability is 6 days. These predictions are on the response scale (availability_30) and represent the expected availability based on the predictors in the original dataset.
Comparison actual values of availability_30 with the predicted values from model: From table above, the predicted values are systematically higher than actual values for lower availability.
## Actual Predicted
## 1 9 8
## 2 18 8
## 3 21 7
## 4 3 26
## 5 2 8
## 6 0 9
Visualization Histogram of Predicted Availability: The following graph shows the most data concentrate in the range 5-10 days
Evaluation of the prediction a) Residual analysis
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -35.00000 -7.00000 -2.00000 -0.00122 5.00000 24.00000
Positive Residuals: Underprediction (model predicts lower availability than actual). Negative Residuals: Overprediction (model predicts higher availability than actual).
Visualization Scatter Plot: Actual vs. Predicted
Points above the red line indicate underprediction. Points below the red line indicate overprediction.
Model Performance Metrics a)Mean Absolute Error (MAE) The average absolute difference between predicted and actual values:
## Mean Absolute Error (MAE): 7
## Root Mean Squared Error (RMSE): 9
Lower MAE and RMSE generally indicate better accuracy. The model’s error metrics suggest low accuracy in its predictions.
To explore predictions for price and occupancy rate we can perform GLM for classification models considering the two response variables as multinomial. This approach requires to classify their values into categories.
We want to further explore what are the key factors influencing accommodation prices in Barcelona. In particular, we would like to analyse what price can be predicted according to the variables above listed.
To fit a multinomial model it is required to convert the prices in different ranges. In this case we divide those into five ordered categories:
‘very low’ >> between 0 and 50 euros,
‘low’ >> between 50 and 150 euros,
‘medium’ between 150 and 300 euros,
‘high’ >> between 300 and 500 euros,
‘very high’ >> between 500 and 1000 euros.
## Price Category Distributions:
## Train.Var1 Train.Freq Test.Var1 Test.Freq
## 1 very low 2599 very low 1757
## 2 low 2552 low 1621
## 3 medium 546 medium 380
## 4 high 181 high 123
## 5 very high 64 very high 51
To test what is the probability that a property belongs to a specific price category according to the predictors, a generalized linear model is fitted. Since the response variable is ordinal (with an inherent order) the family ‘cumulative’ with link ‘logit’ is used.
## (Intercept):1 (Intercept):2 (Intercept):3
## -0.8035368 2.2753092 3.7521347
## (Intercept):4 bathrooms bedrooms
## 5.2232283 -2.3593679 -0.3506564
## accommodates beds latitude
## -5.6672584 2.1228761 1.0898612
## longitude review_scores_rating minimum_nights
## -0.7963321 0.3916568 30.0639506
## room_typePrivate room room_typeShared room neighbourhood
## 1.9761966 3.2723914 -0.4412099
The four intercepts in the summary are thresholds (cut-points) for the cumulative probabilities of categories. The coefficients represent the change in the log-odds of being in or below a given category compared to higher categories, for a one-unit increase in the predictor, assuming other predictors remain constant. A negative coefficient indicates the increase of the specific predictor decreases the likelihood of being in a lower price category. This happens, for example, for the coefficients bathrooms, bedrooms, longitude and neighbourhood. Also accommodates shows a highly significant role with a coefficient around -5.67 that means more accommodates drastically decrease the likelihood of being in a lower price category. This aligns with the fact that a bigger property can become more expensive. Another key factor is the number of minimum nights: a higher number of minimum nights increase the likelihood of being in lower price categories. This would reflect the policy of hosts to set lower prices for longer stays. The neighborhood does not seem to affect the response variable, even if a better model might give another result, as it is commonly known that according to locations prices generally vary.
It is now possible to calculate the predicted probabilities for each price category and evaluate how many predictions are correct or incorrect in comparison with the test set.
## Actual
## Predicted very low low medium high very high
## 1 1517 502 33 10 9
## 2 237 1105 305 81 23
## 3 3 12 25 17 13
## 4 0 1 11 4 4
## 5 0 1 6 11 2
According to the confusion matrix, the most correct predictions concern the categories ‘very low’ and ‘low’ prices, while for the others categories there is not a good result.
To evaluate the performance of the model we consider the precision (measures for each class how often the model correctly identifies a class) and recall (measures for each class how often the model identifies all true instances of a class).
## [1] "Average Precision: 0.404141439373797"
## [1] "Average Recall: 0.336521398092365"
The average of precision and recall metrics are respectively around 40% and 34%. It means that there is around 40% chance to get a correct prediction and only 34% of the true instances are correctly identified. In conclusion, the classification model is not performing well.
The following model want to evaluate the prediction of the occupancy rates in 30 days based on location, price, and the other factors above listed. To fit the multinomial model it is required to convert the occupancy rate in different ranges. Considering the non-linear distribution of the data points we prefer a data-driven approach using quantile-based bins to divide the data into categories that have approximately the same number of observations. Since the values for occupancy rate run between 0 and 100 we divide them in ‘low’, ‘medium’ and ‘high’ categories.
##
## Distribution into Train Set:
## < table of extent 0 >
##
## Distribution into Test Set:
##
## low medium high
## 1433 1233 1266
To test what is the probability that a property has a specific occupancy rate in the next 30 days depending on the other variables, as the response variable is ordinal (with an inherent order), also in this case the family ‘cumulative’ with link ‘logit’ is used.
## (Intercept):1 (Intercept):2 bathrooms
## -0.93536753 0.42326751 1.88802490
## bedrooms accommodates beds
## -1.33954306 -0.20559357 1.00581739
## latitude longitude review_scores_rating
## -0.02823400 0.03204691 -0.04403345
## minimum_nights price neighbourhood
## -0.42583334 2.97849898 -0.70877217
The summary of the model provides two intercepts that should be interpreted respectively as the threshold (logit) between low and medium/high and low/medium and high. Coefficients with negative sign indicate that as their values increase, the odds of being in a higher category of occupancy rate (medium or high) decrease. This result might align with the fact that the bigger the house, the less probability to be booked in the next 30 days. According to the results, the more the bedrooms, accommodates, and minimum number of nights the less probability the property has to be in a higher class of occupancy rate, possibly due to pricing, target market or other factors. The positive coefficients for bathrooms and beds indicate that properties with more of those rooms are more likely to have a higher occupancy rate. These room types might increase the property’s convenience and attractiveness for larger groups leading to a higher demand and occupancy rate. So far, comparing for example the negative coefficient of bedrooms and the positive coefficient of beds, one could claim that groups of visitor prefer smaller properties with shared spaces instead of bigger accommodations. The small positive coefficient for price needs to be considered as: for every unit increase in price (any additional euro), the odds of being in a higher occupancy category increase by around 0.3%. Considering the result is almost 0 and the unit of 1 euro represents a small increase in real price, we can conclude that the price has a modest impact on the likelihood of higher occupancy rate category.
We want to get the predicted probabilities for each occupancy rate category and evaluate how many predictions from the model are correct or incorrect in comparison with the test set.
## predicted_categories
## high low
## 2030 1902
The confusion matrix shows that the model is not predicting probabilities of occupancy rate range for the class ‘medium’.
## low medium high
## 2 0.2946401 0.32444848 0.38091140
## 4 0.8984865 0.07329506 0.02821846
## 7 0.3090546 0.32602989 0.36491554
## 12 0.2625250 0.31819973 0.41927523
## 14 0.3526358 0.32679599 0.32056817
## 15 0.3250433 0.32698003 0.34797671
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.07129 0.32497 0.32638 0.32025 0.32695 0.32717
Inspecting the predicted probabilities, it appears that the values for the intermediate category are consistently near 0 for the observations. This might indicate that the model is ignoring this class and giving results only for the others.
## Confusion Matrix and Statistics
##
## Reference
## Prediction low medium high
## low 783 642 477
## medium 0 0 0
## high 650 591 789
##
## Overall Statistics
##
## Accuracy : 0.3998
## 95% CI : (0.3844, 0.4153)
## No Information Rate : 0.3644
## P-Value [Acc > NIR] : 2.537e-06
##
## Kappa : 0.0871
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: low Class: medium Class: high
## Sensitivity 0.5464 0.0000 0.6232
## Specificity 0.5522 1.0000 0.5345
## Pos Pred Value 0.4117 NaN 0.3887
## Neg Pred Value 0.6798 0.6864 0.7492
## Prevalence 0.3644 0.3136 0.3220
## Detection Rate 0.1991 0.0000 0.2007
## Detection Prevalence 0.4837 0.0000 0.5163
## Balanced Accuracy 0.5493 0.5000 0.5789
The accuracy of the model is around 40%. It means that there is around 40% chance to get a correct prediction that is not a satisfactory result. To evaluate the performance of the model, we consider the precision and recall.
## [1] "Average Precision: 0.400170937514569"
## [1] "Average Recall: 0.389876296592727"
The precision of the model is around 40%, similar to the result achieved in the accuracy of the confusion matrix. The recall value demonstrates that around 39% of the true instances are correctly identified. In conclusion, the classification model is not performing very well as it also happened in the GLM classification model for the price prediction.
In this chapter, Generalized Additive Models (GAMs) will be applied with the Price variable as the response, to analyze its interactions with the predictor variables.
The first goal is to identify key factors influencing prices in Barcelona, addressing Research Question 1 (What are the key factors influencing accommodation prices in Barcelona?)
First, we aim to determine whether a nonlinear relationship exists between the independent variables and price. To explore this, the variable Review Score Rating will be plotted against Price to visualize whether the relationship is linear or not.
From the plot above and Chapter 4.3 of this report, we observe that many points are concentrated on the right side, where higher review scores are paired with lower prices. This suggests a lack of a strong relationship between Review Score Ratings and Price.
Given that at least one variable does not exhibit a linear relationship, we will proceed with applying a Generalized Additive Model (GAM) to better capture potential nonlinear interactions.
The GAM model is performed using the training data.
##
## Family: gaussian
## Link function: identity
##
## Formula:
## price ~ s(bathrooms) + s(bedrooms) + s(accommodates) + s(beds) +
## s(latitude) + s(longitude) + s(review_scores_rating) + s(minimum_nights) +
## room_type + neighbourhood
##
## Parametric coefficients:
## Estimate Std. Error t value
## (Intercept) 174.428 53.585 3.255
## room_typePrivate room -54.549 3.773 -14.456
## room_typeShared room -79.789 12.466 -6.400
## neighbourhoodCamp d'en Grassot i Gràcia Nova -67.663 55.026 -1.230
## neighbourhoodCan Baro -44.474 62.204 -0.715
## neighbourhoodCarmel -58.667 55.395 -1.059
## neighbourhoodCiutat Vella -54.459 54.607 -0.997
## neighbourhoodDiagonal Mar - La Mar Bella -13.863 57.198 -0.242
## neighbourhoodDreta de l'Eixample -26.482 54.019 -0.490
## neighbourhoodEixample -47.034 53.738 -0.875
## neighbourhoodEl Baix Guinardó -47.524 55.788 -0.852
## neighbourhoodEl Besòs i el Maresme -67.817 56.913 -1.192
## neighbourhoodEl Bon Pastor -78.193 67.115 -1.165
## neighbourhoodEl Born -62.377 56.385 -1.106
## neighbourhoodEl Camp de l'Arpa del Clot -64.039 55.512 -1.154
## neighbourhoodEl Clot -46.368 57.012 -0.813
## neighbourhoodEl Coll -99.812 80.982 -1.233
## neighbourhoodEl Congrés i els Indians -73.326 57.555 -1.274
## neighbourhoodel Fort Pienc -53.375 55.087 -0.969
## neighbourhoodEl Gòtic -51.478 55.038 -0.935
## neighbourhoodEl Poble-sec -63.246 54.885 -1.152
## neighbourhoodEl Poblenou -55.890 55.754 -1.002
## neighbourhoodEl Putget i Farró 33.413 54.681 0.611
## neighbourhoodEl Raval -56.553 54.594 -1.036
## neighbourhoodGlòries - El Parc -74.015 55.891 -1.324
## neighbourhoodGràcia -37.853 53.329 -0.710
## neighbourhoodGuinardó -2.162 55.442 -0.039
## neighbourhoodHorta -36.868 72.911 -0.506
## neighbourhoodHorta-Guinardó -43.272 52.805 -0.819
## neighbourhoodL'Antiga Esquerra de l'Eixample -47.718 53.980 -0.884
## neighbourhoodLa Barceloneta -49.809 56.159 -0.887
## neighbourhoodLa Font d'en Fargues -65.480 80.839 -0.810
## neighbourhoodLa Maternitat i Sant Ramon -80.454 55.530 -1.449
## neighbourhoodLa Nova Esquerra de l'Eixample -53.857 54.227 -0.993
## neighbourhoodLa Prosperitat -11.584 103.804 -0.112
## neighbourhoodLa Sagrada Família -61.659 54.290 -1.136
## neighbourhoodLa Sagrera -39.724 59.062 -0.673
## neighbourhoodLa Salut -59.806 56.479 -1.059
## neighbourhoodLa Teixonera -94.493 68.290 -1.384
## neighbourhoodLa Trinitat Vella -87.274 84.294 -1.035
## neighbourhoodLa Verneda i La Pau -73.838 63.074 -1.171
## neighbourhoodLa Vila Olímpica -36.563 58.240 -0.628
## neighbourhoodLes Corts -60.479 53.881 -1.122
## neighbourhoodLes Tres Torres -49.452 63.813 -0.775
## neighbourhoodMontbau -39.033 72.183 -0.541
## neighbourhoodNavas -65.133 57.272 -1.137
## neighbourhoodNou Barris -51.330 54.858 -0.936
## neighbourhoodPedralbes -13.899 65.834 -0.211
## neighbourhoodPorta -46.894 81.671 -0.574
## neighbourhoodProvençals del Poblenou -53.560 59.460 -0.901
## neighbourhoodSant Andreu -61.706 54.465 -1.133
## neighbourhoodSant Andreu de Palomar -50.605 58.877 -0.860
## neighbourhoodSant Antoni -51.243 54.544 -0.939
## neighbourhoodSant Genís dels Agudells -22.436 67.365 -0.333
## neighbourhoodSant Gervasi - Galvany -51.299 54.462 -0.942
## neighbourhoodSant Gervasi - la Bonanova -39.891 59.974 -0.665
## neighbourhoodSant Martí -56.760 54.552 -1.040
## neighbourhoodSant Martí de Provençals -63.418 57.789 -1.097
## neighbourhoodSant Pere/Santa Caterina -54.769 55.116 -0.994
## neighbourhoodSants-Montjuïc -58.904 53.901 -1.093
## neighbourhoodSarrià -43.530 57.619 -0.755
## neighbourhoodSarrià-Sant Gervasi -31.073 53.380 -0.582
## neighbourhoodTrinitat Nova -34.386 83.495 -0.412
## neighbourhoodTuró de la Peira - Can Peguera -38.170 72.901 -0.524
## neighbourhoodVallcarca i els Penitents -14.720 55.757 -0.264
## neighbourhoodVerdum - Los Roquetes -57.912 75.111 -0.771
## neighbourhoodVila de Gràcia -36.911 53.616 -0.688
## neighbourhoodVilapicina i la Torre Llobeta -50.991 59.803 -0.853
## Pr(>|t|)
## (Intercept) 0.00114 **
## room_typePrivate room < 2e-16 ***
## room_typeShared room 1.67e-10 ***
## neighbourhoodCamp d'en Grassot i Gràcia Nova 0.21887
## neighbourhoodCan Baro 0.47465
## neighbourhoodCarmel 0.28962
## neighbourhoodCiutat Vella 0.31867
## neighbourhoodDiagonal Mar - La Mar Bella 0.80850
## neighbourhoodDreta de l'Eixample 0.62398
## neighbourhoodEixample 0.38148
## neighbourhoodEl Baix Guinardó 0.39432
## neighbourhoodEl Besòs i el Maresme 0.23347
## neighbourhoodEl Bon Pastor 0.24404
## neighbourhoodEl Born 0.26865
## neighbourhoodEl Camp de l'Arpa del Clot 0.24871
## neighbourhoodEl Clot 0.41607
## neighbourhoodEl Coll 0.21781
## neighbourhoodEl Congrés i els Indians 0.20271
## neighbourhoodel Fort Pienc 0.33262
## neighbourhoodEl Gòtic 0.34966
## neighbourhoodEl Poble-sec 0.24923
## neighbourhoodEl Poblenou 0.31617
## neighbourhoodEl Putget i Farró 0.54119
## neighbourhoodEl Raval 0.30030
## neighbourhoodGlòries - El Parc 0.18547
## neighbourhoodGràcia 0.47785
## neighbourhoodGuinardó 0.96890
## neighbourhoodHorta 0.61311
## neighbourhoodHorta-Guinardó 0.41255
## neighbourhoodL'Antiga Esquerra de l'Eixample 0.37674
## neighbourhoodLa Barceloneta 0.37515
## neighbourhoodLa Font d'en Fargues 0.41797
## neighbourhoodLa Maternitat i Sant Ramon 0.14744
## neighbourhoodLa Nova Esquerra de l'Eixample 0.32066
## neighbourhoodLa Prosperitat 0.91115
## neighbourhoodLa Sagrada Família 0.25612
## neighbourhoodLa Sagrera 0.50124
## neighbourhoodLa Salut 0.28969
## neighbourhoodLa Teixonera 0.16650
## neighbourhoodLa Trinitat Vella 0.30055
## neighbourhoodLa Verneda i La Pau 0.24179
## neighbourhoodLa Vila Olímpica 0.53017
## neighbourhoodLes Corts 0.26171
## neighbourhoodLes Tres Torres 0.43840
## neighbourhoodMontbau 0.58870
## neighbourhoodNavas 0.25547
## neighbourhoodNou Barris 0.34947
## neighbourhoodPedralbes 0.83280
## neighbourhoodPorta 0.56586
## neighbourhoodProvençals del Poblenou 0.36774
## neighbourhoodSant Andreu 0.25728
## neighbourhoodSant Andreu de Palomar 0.39010
## neighbourhoodSant Antoni 0.34753
## neighbourhoodSant Genís dels Agudells 0.73910
## neighbourhoodSant Gervasi - Galvany 0.34627
## neighbourhoodSant Gervasi - la Bonanova 0.50599
## neighbourhoodSant Martí 0.29816
## neighbourhoodSant Martí de Provençals 0.27251
## neighbourhoodSant Pere/Santa Caterina 0.32041
## neighbourhoodSants-Montjuïc 0.27452
## neighbourhoodSarrià 0.44998
## neighbourhoodSarrià-Sant Gervasi 0.56052
## neighbourhoodTrinitat Nova 0.68047
## neighbourhoodTuró de la Peira - Can Peguera 0.60058
## neighbourhoodVallcarca i els Penitents 0.79179
## neighbourhoodVerdum - Los Roquetes 0.44072
## neighbourhoodVila de Gràcia 0.49121
## neighbourhoodVilapicina i la Torre Llobeta 0.39389
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(bathrooms) 4.288 5.251 11.844 < 2e-16 ***
## s(bedrooms) 5.078 5.968 4.426 0.000178 ***
## s(accommodates) 2.155 2.811 23.342 < 2e-16 ***
## s(beds) 5.157 6.011 3.090 0.005113 **
## s(latitude) 7.387 8.429 6.114 < 2e-16 ***
## s(longitude) 1.011 1.022 4.651 0.030939 *
## s(review_scores_rating) 4.113 4.990 17.244 < 2e-16 ***
## s(minimum_nights) 5.147 5.957 46.977 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.335 Deviance explained = 34.6%
## -REML = 34744 Scale est. = 7667.3 n = 5943
The model explains 33.6% of the deviance, indicating moderate exploratory power with potential room for improvement.
The model identified key findings that indicate bathroom facilities, accommodation, and review scores as the strongest predictors, with higher values showing a significant increase in price. Spatial variables such as latitude and longitude were also found to be significant .Among the categorical predictors, room type exhibited a significant impact, with private rooms and shared rooms reducing prices compared to entire homes/apartments. While neighborhoods did not show strong statistical significance.
## R-squared on Test Set: 0.3279427
The plot above compares the observed prices against the predicted values generated by the model. It can be observed that the points deviate more from the red line at higher price ranges, indicating that the model performs better for lower price ranges. This behavior can be attributed to the presence of outliers, which may influence the model’s ability to accurately predict higher prices.
## MAE: 45.69611
## RMSE: 91.81648
## R-squared: 0.3279427
The results above provides key evaluation metrics for the GAM model used to predict Airbnb prices.
Root Mean Square Error (RMSE): The RMSE is 87.05, suggesting that there are some outliers or significant deviations between observed and predicted prices, especially for higher-priced Airbnb’s
These metrics suggest that while the GAM provides a goof fit for lower-priced listings, its predictive performance is less accurate for higher prices, likely due to the influence of outliers.
##
## Family: gaussian
## Link function: identity
##
## Formula:
## occupancy_rate_30 ~ s(latitude) + s(longitude) + s(bathrooms) +
## s(bedrooms) + s(accommodates) + s(beds) + s(price) + s(minimum_nights) +
## s(review_scores_rating) + (neighbourhood)
##
## Parametric coefficients:
## Estimate Std. Error t value
## (Intercept) 66.84643 20.78582 3.216
## neighbourhoodCamp d'en Grassot i Gràcia Nova 15.29053 21.21276 0.721
## neighbourhoodCan Baro -4.89359 23.39917 -0.209
## neighbourhoodCarmel 18.05269 21.67874 0.833
## neighbourhoodCiutat Vella 5.17335 21.00320 0.246
## neighbourhoodDiagonal Mar - La Mar Bella 2.52654 21.81608 0.116
## neighbourhoodDreta de l'Eixample 7.36383 20.90720 0.352
## neighbourhoodEixample 8.34876 20.84609 0.400
## neighbourhoodEl Baix Guinardó 1.40899 21.42145 0.066
## neighbourhoodEl Besòs i el Maresme 7.56102 21.84241 0.346
## neighbourhoodEl Bon Pastor 2.72680 24.72421 0.110
## neighbourhoodEl Born 3.47964 21.47404 0.162
## neighbourhoodEl Camp de l'Arpa del Clot -7.33206 21.27541 -0.345
## neighbourhoodEl Clot 4.96020 21.67989 0.229
## neighbourhoodEl Coll 29.51129 29.14518 1.013
## neighbourhoodEl Congrés i els Indians 12.58958 22.13662 0.569
## neighbourhoodel Fort Pienc 9.86991 21.17010 0.466
## neighbourhoodEl Gòtic 4.51670 21.12756 0.214
## neighbourhoodEl Poble-sec 2.46171 21.16763 0.116
## neighbourhoodEl Poblenou 7.61134 21.36080 0.356
## neighbourhoodEl Putget i Farró -1.04808 21.22717 -0.049
## neighbourhoodEl Raval 9.21152 21.04325 0.438
## neighbourhoodGlòries - El Parc -7.58301 21.37555 -0.355
## neighbourhoodGràcia 9.80653 20.81584 0.471
## neighbourhoodGuinardó 7.44524 21.51030 0.346
## neighbourhoodHorta -8.48519 26.64170 -0.318
## neighbourhoodHorta-Guinardó 12.62002 20.85610 0.605
## neighbourhoodL'Antiga Esquerra de l'Eixample 3.58489 20.97582 0.171
## neighbourhoodLa Barceloneta 3.02710 21.40393 0.141
## neighbourhoodLa Font d'en Fargues 14.32327 29.08695 0.492
## neighbourhoodLa Maternitat i Sant Ramon 4.17613 20.55349 0.203
## neighbourhoodLa Nova Esquerra de l'Eixample 10.01991 21.00567 0.477
## neighbourhoodLa Prosperitat -34.74147 35.67909 -0.974
## neighbourhoodLa Sagrada Família 9.39787 20.95887 0.448
## neighbourhoodLa Sagrera -3.50231 22.54129 -0.155
## neighbourhoodLa Salut 9.59270 21.70711 0.442
## neighbourhoodLa Teixonera 22.47110 25.32633 0.887
## neighbourhoodLa Trinitat Vella 28.12841 29.53684 0.952
## neighbourhoodLa Verneda i La Pau 0.01125 23.67120 0.000
## neighbourhoodLa Vila Olímpica -0.47263 22.03475 -0.021
## neighbourhoodLes Corts 5.11209 20.65352 0.248
## neighbourhoodLes Tres Torres 14.64719 23.80717 0.615
## neighbourhoodMontbau 10.58624 26.57013 0.398
## neighbourhoodNavas -12.92257 21.88357 -0.591
## neighbourhoodNou Barris -7.61793 21.23428 -0.359
## neighbourhoodPedralbes -14.86333 22.84037 -0.651
## neighbourhoodPorta -58.95588 29.15289 -2.022
## neighbourhoodProvençals del Poblenou 11.87172 22.41982 0.530
## neighbourhoodSant Andreu -0.08328 21.21798 -0.004
## neighbourhoodSant Andreu de Palomar 13.64181 22.43753 0.608
## neighbourhoodSant Antoni 6.88885 21.06979 0.327
## neighbourhoodSant Genís dels Agudells 13.18838 25.17867 0.524
## neighbourhoodSant Gervasi - Galvany -1.27434 21.16764 -0.060
## neighbourhoodSant Gervasi - la Bonanova -21.49174 22.60517 -0.951
## neighbourhoodSant Martí 2.45604 21.00509 0.117
## neighbourhoodSant Martí de Provençals 1.64003 22.03713 0.074
## neighbourhoodSant Pere/Santa Caterina 5.83101 21.12686 0.276
## neighbourhoodSants-Montjuïc 5.55296 20.92918 0.265
## neighbourhoodSarrià 7.92799 20.33856 0.390
## neighbourhoodSarrià-Sant Gervasi 0.94816 20.75466 0.046
## neighbourhoodTrinitat Nova -1.81837 29.37886 -0.062
## neighbourhoodTuró de la Peira - Can Peguera 6.57138 26.63044 0.247
## neighbourhoodVallcarca i els Penitents 8.77051 21.55017 0.407
## neighbourhoodVerdum - Los Roquetes 24.39287 26.86071 0.908
## neighbourhoodVila de Gràcia 8.68923 20.88628 0.416
## neighbourhoodVilapicina i la Torre Llobeta 9.94946 22.84562 0.436
## Pr(>|t|)
## (Intercept) 0.00131 **
## neighbourhoodCamp d'en Grassot i Gràcia Nova 0.47105
## neighbourhoodCan Baro 0.83435
## neighbourhoodCarmel 0.40503
## neighbourhoodCiutat Vella 0.80545
## neighbourhoodDiagonal Mar - La Mar Bella 0.90781
## neighbourhoodDreta de l'Eixample 0.72469
## neighbourhoodEixample 0.68881
## neighbourhoodEl Baix Guinardó 0.94756
## neighbourhoodEl Besòs i el Maresme 0.72923
## neighbourhoodEl Bon Pastor 0.91218
## neighbourhoodEl Born 0.87128
## neighbourhoodEl Camp de l'Arpa del Clot 0.73039
## neighbourhoodEl Clot 0.81904
## neighbourhoodEl Coll 0.31131
## neighbourhoodEl Congrés i els Indians 0.56957
## neighbourhoodel Fort Pienc 0.64108
## neighbourhoodEl Gòtic 0.83072
## neighbourhoodEl Poble-sec 0.90742
## neighbourhoodEl Poblenou 0.72161
## neighbourhoodEl Putget i Farró 0.96062
## neighbourhoodEl Raval 0.66159
## neighbourhoodGlòries - El Parc 0.72279
## neighbourhoodGràcia 0.63758
## neighbourhoodGuinardó 0.72926
## neighbourhoodHorta 0.75012
## neighbourhoodHorta-Guinardó 0.54514
## neighbourhoodL'Antiga Esquerra de l'Eixample 0.86430
## neighbourhoodLa Barceloneta 0.88754
## neighbourhoodLa Font d'en Fargues 0.62243
## neighbourhoodLa Maternitat i Sant Ramon 0.83900
## neighbourhoodLa Nova Esquerra de l'Eixample 0.63337
## neighbourhoodLa Prosperitat 0.33024
## neighbourhoodLa Sagrada Família 0.65388
## neighbourhoodLa Sagrera 0.87653
## neighbourhoodLa Salut 0.65857
## neighbourhoodLa Teixonera 0.37497
## neighbourhoodLa Trinitat Vella 0.34098
## neighbourhoodLa Verneda i La Pau 0.99962
## neighbourhoodLa Vila Olímpica 0.98289
## neighbourhoodLes Corts 0.80452
## neighbourhoodLes Tres Torres 0.53842
## neighbourhoodMontbau 0.69033
## neighbourhoodNavas 0.55487
## neighbourhoodNou Barris 0.71979
## neighbourhoodPedralbes 0.51523
## neighbourhoodPorta 0.04319 *
## neighbourhoodProvençals del Poblenou 0.59647
## neighbourhoodSant Andreu 0.99687
## neighbourhoodSant Andreu de Palomar 0.54322
## neighbourhoodSant Antoni 0.74371
## neighbourhoodSant Genís dels Agudells 0.60044
## neighbourhoodSant Gervasi - Galvany 0.95200
## neighbourhoodSant Gervasi - la Bonanova 0.34177
## neighbourhoodSant Martí 0.90692
## neighbourhoodSant Martí de Provençals 0.94068
## neighbourhoodSant Pere/Santa Caterina 0.78256
## neighbourhoodSants-Montjuïc 0.79077
## neighbourhoodSarrià 0.69670
## neighbourhoodSarrià-Sant Gervasi 0.96356
## neighbourhoodTrinitat Nova 0.95065
## neighbourhoodTuró de la Peira - Can Peguera 0.80510
## neighbourhoodVallcarca i els Penitents 0.68404
## neighbourhoodVerdum - Los Roquetes 0.36385
## neighbourhoodVila de Gràcia 0.67741
## neighbourhoodVilapicina i la Torre Llobeta 0.66321
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(latitude) 2.020 2.695 0.807 0.503167
## s(longitude) 5.694 7.000 2.304 0.024207 *
## s(bathrooms) 1.011 1.021 11.414 0.000688 ***
## s(bedrooms) 4.845 5.790 3.527 0.001735 **
## s(accommodates) 4.274 5.216 8.978 < 2e-16 ***
## s(beds) 2.393 3.030 1.410 0.238039
## s(price) 5.429 6.518 29.183 < 2e-16 ***
## s(minimum_nights) 4.105 4.790 8.485 1.3e-07 ***
## s(review_scores_rating) 2.431 2.918 18.879 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.0723 Deviance explained = 8.75%
## -REML = 28234 Scale est. = 834.69 n = 5943
The GAM model, which is employed to predict occupancy rates in 30 days, explains only 9% of the variability in the data, with a deviance explained of 10.9%.The results indicate that the predictors price, minimum nights, review scores, rating, bathrooms, bedrooms and accommodations are significant for determining occupancy rates, but spatial variables are less so, as is the number of beds. The low R² value indicates that other factors, not included in the dataset, such as seasonality, might play a significant role in determining occupancy rates.
## R-squared on Test Set (Occupancy 30): 0.05990545
## MAE: 64.63602
## RMSE: 66.07687
The MAE indicates that, on average, the model’s predictions deviate from the observed occupancy rates by approximately 64.47 percentage points.
The RMSE, which has a value of 66, suggests the presence of some outliers, indicating the use of an alternative model.
A single layer neural network with three hidden nodes is trained, using normalize training data. The model used a linear activation function for the outputs.
## Length Class Mode
## call 5 -none- call
## response 5943 -none- numeric
## covariate 47544 -none- numeric
## model.list 2 -none- list
## err.fct 1 -none- function
## act.fct 1 -none- function
## linear.output 1 -none- logical
## data 26 data.frame list
## exclude 0 -none- NULL
## net.result 1 -none- list
## weights 1 -none- list
## generalized.weights 1 -none- list
## startweights 1 -none- list
## result.matrix 34 -none- numeric
# Ensure relevant columns are numeric
test_normalized <- test_normalized %>%
mutate(across(c("bathrooms", "bedrooms", "accommodates", "beds",
"latitude", "longitude", "review_scores_rating",
"minimum_nights"), as.numeric))
# Prediction using the neural network model
nn_predictions <- compute(nn_price,
test_normalized[, c("bathrooms", "bedrooms", "accommodates",
"beds", "latitude", "longitude",
"review_scores_rating", "minimum_nights")])
# Denormalize the predictions
denormalize <- function(x, original_min, original_max) {
x * (original_max - original_min) + original_min
}
test$predicted_price <- denormalize(nn_predictions$net.result,
min(train$price),
max(train$price))
# Create the Observed vs Predicted plot
nn_plot <- ggplot(test, aes(x = price, y = predicted_price)) +
geom_point(alpha = 0.5, color = "blue") +
geom_abline(slope = 1, intercept = 0, color = "red") +
labs(title = "Neural Network Model: Observed vs Predicted Prices",
x = "Observed Price", y = "Predicted Price") +
theme_minimal()
# Save the plot as a PNG file
ggsave("nn_price_plot.png", plot = nn_plot, width = 8, height = 6)
# Include the saved plot in the knitted document
knitr::include_graphics("nn_price_plot.png")
Similar to the GAM model, the plot above shows that the points spread more significantly at higher price ranges, indicating that the neural network model is more accurate when predicting lower prices. This suggests that the model struggles to capture the variability in higher-priced listings, potentially due to outliers.
## MAE: 46.24316
## RMSE: 93.47576
## R-squared: 0.3034329
The R-squared value is 0.285, meaning that the model explains only the 28.5% of the variability in the prices.
The SVR model for Price prediction is a regression model (SVR) since the target variable is numeric. We consider the normalized numerical variables since the model is sensitive to the scale of the features. Moreover, as the response variable presents a skewed distribution we apply a log transformation to optimize it.
For the SVR model we choose a radial kernel to better treat variables non-linear dependent. Tuning the cost parameter for Support Vectors model is crucial to finding the optimal balance between over-fitting and under-fitting. The cost parameter (C) controls the trade-off between having a wide margin and correctly classifying the training set. (Disclaimer: the tuning takes very long to be run and it remains commented in the code file for convenience.)
# Train the SVM model
SVM_model <- svm(formula_svm, data=train_normalized_encod, kernel = 'radial', scale = FALSE, cost = 100000, gamma = 0.05)
summary(SVM_model)
##
## Call:
## svm(formula = formula_svm, data = train_normalized_encod, kernel = "radial",
## cost = 1e+05, gamma = 0.05, scale = FALSE)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1e+05
## gamma: 0.05
## epsilon: 0.1
##
##
## Number of Support Vectors: 1181
The SVR model can be evaluated considering the Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE).
## MAE: 0.05357919
## RMSE: 0.09703205
## R-squared: 0.2598941
A Mean Absolute Error (MAE) around 0.05 suggests that, on average, the model’s predictions are off by about 0.05 units from the actual price, that suggest a good performance. However, in general, a RMSE larger than MAE can indicate outliers or extreme errors that influence the result. Eventually, the R-squared value indicates that the model can explain just around 26% of the variance of the price, indicating a not good performance of the model.
We perform a the Support Vectors regression model for the Occupancy Rate. Similar to the SVR price model, we consider the normalized numerical variables and encode categorical variables into numerical values. Also in this case, we choose a radial kernel since the predictors have mainly non-linear relationships with the response variable. (Disclaimer: the tuning takes very long to be run and it remains commented in the code file for convenience.)
# Train the SVM model
SVM_occupancy <- svm(formula_svr_occupancy, data=train_normalized_encod, kernel = 'radial', scale = FALSE, cost = 1000000, gamma = 0.1, type='eps-regression')
Also in this case, the model can be evaluated considering the following metrics: - Mean Absolute Error (MAE), - Mean Squared Error (MSE), - Root Mean Squared Error (RMSE).
## MAE: 22.29245
## RMSE: 30.90774
## R-squared: -0.06299629
A Mean Absolute Error (MAE) around 22 suggests that, on average, the model’s predictions are off by about 22 units from the actual occupancy rate. Also in this case, a RMSE larger than MAE can indicate outliers or extreme errors that influence the result. However, the R-squared value indicates that the model is performing worse than a simple baseline model that predicts the mean of the target variable for all observations. It cannot generalize or capture the variance in the data.
Inspecting th distribution of the data, it appears that the predictions are concentrated around the higher (the 1st quartile is 83.30 that is very close to the maximum value of 95.65).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -96.70 79.56 84.75 82.57 88.81 118.95
As concerns the use of Generative AI, it supported modelling understanding and coding syntax. In the first case, it has been useful to evaluate what kind of model perform (i.e. whether a classification or regression model would have been more appropriate). The use of GenAI also helped to speed up the coding activity when testing different solutions or plotting results, as well as understanding syntax errors.
It has been a powerful tool for addressing conceptual doubts and clarifying key ideas, providing strong explanatory support. However, obtaining accurate results or aligning AI suggestions with the user’s intent can be challenging. It often requires detailed and precise questions, which can be time-consuming and not always efficient. Furthermore, all AI-generated responses must be carefully validated, as errors can still occur.
Assuming linear relationships, the linear model for price prediction was developed based on the selection of variables informed by the results of the analysis.Key variables, such as accommodates and room_type, were identified as having a strong impact on price, as previously discussed. According to the metrics, approximately 55% of the variability in the log-transformed price is explained by the predictors included in the model, indicating a moderate explanatory power. In the linear model for occupancy rate prediction, predictors such as price, review_scores_rating has a strong impact on the response variable. But, based on model´s performance metrics, the model demonstrates very low explanatory power. Only approximately 6% of the variability in occupancy rates is explained by the model.Therefore, the predictions for new data also yield unsatisfactory results.
As regards the generalized linear model with Poisson distribution, the model identifies notable relationships, particularly between accommodation types, accommodates, number of reviews on response variable availability_30. However, based on the model´s performance metrics the model has very low explanatory power, not even 7%. In addition, the model’s error metrics suggest very low accuracy in its predictions.
The generalized linear models for multinational variables gave accuracy, for both price and occupancy rate categories predictions, less than 50% that cannot be representative of good performance. However, the evaluation of the coefficients explained that bathrooms, accommodates, latitude, and minimum_nights are strong determinants for the price category, while and the main influence of bathrooms, beds, bedrooms, accommodates, and minimum number have effect on the occupancy rate model.
In the Generalized Additive model for the price prediction did not perform well, explaining only approx. 34% of the variable’s variance, while the same model for occupancy rate gave an explanation of less than 5%, probably due to the influence of outliers.
Support Vector Regression model provided also not satisfactory results for the price prediction and did not performed for the occupancy rate prediction.
As concerns the Neural Network model for price prediction, the model performed only for the 28% of the variable’s variance, while for the occupancy rate it was not possible to perform a model.
The low performance of the different models could be imposed by the data distribution and the presence of outliers that might influence irremediably some of the models.
Further achievements can be obtained with a more in-depth analysis of data and a more selective approach toward the variables.